Your world renowned work in cancer research, sports recruiting, advertising, and environmental research has lead the government to reach out an ask for help in better understanding the US populous. Your goal is to build a classifier that can predict the income levels in order to create more effective policy.

The dataset below includes Census data on 32,000+ individuals with a variety of variables and a target variable for above or below 50k in salary.

Your goal is to build a Random Forest Classifier to be able to predict income levels above or below 50k.

Calculate the base rate

length <- length(filter(census, income == "1")$income)
baserate <- length/length(census$income)
baserate
## [1] 0.2406342

Models

Initial RF Model

Initial mtry

Calculate the initial mtry level

mtry_initial <- round(sqrt(length(census)))
mtry_initial
## [1] 4

Running initial model

Run the initial RF model with 500 trees

census_RF
## 
## Call:
##  randomForest(formula = income ~ ., data = census_train, ntree = 500,      mtry = mtry_initial, replace = TRUE, sampsize = 100, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 15.67%
## Confusion matrix:
##       0    1 class.error
## 0 20561 1261  0.05778572
## 1  3248 3711  0.46673373

Initial Model Analysis

Review the percentage of trees that voted for each data point to be in each class.

census_votes <- as.data.frame(census_RF$votes)
kable(head(census_votes))
0 1
0.8740000 0.1260000
0.4088176 0.5911824
0.9639279 0.0360721
0.6860000 0.3140000
0.5322581 0.4677419
0.3400402 0.6599598

Review the “predicted” argument contains a vector of predictions for each data point.

head(as.data.frame(census_RF$votes))
##           0          1
## 1 0.8740000 0.12600000
## 2 0.4088176 0.59118236
## 3 0.9639279 0.03607214
## 4 0.6860000 0.31400000
## 5 0.5322581 0.46774194
## 6 0.3400402 0.65995976
head(as.data.frame(census_RF$predicted))
##   census_RF$predicted
## 1                   0
## 2                   1
## 3                   0
## 4                   0
## 5                   0
## 6                   1

Determine the most important variable for your model

census_importance_DF <- as.data.frame(census_RF$importance)
kable(census_importance_DF[-c(1:2)])
MeanDecreaseAccuracy MeanDecreaseGini
age 0.0086944 3.1452175
workclass 0.0023666 1.6949648
fnlwgt 0.0003535 2.4118733
education 0.0116288 3.3959073
education_num 0.0117018 2.3795816
marital_status 0.0308497 3.2450132
occupation 0.0137208 4.8070253
relationship 0.0274160 3.2195532
race 0.0001544 0.3665042
sex 0.0018075 0.3422816
capital_gain 0.0146781 2.4104034
capital_loss 0.0014807 0.5492761
hours_per_week 0.0043147 1.7988305
native_country -0.0000891 0.4230629

Occupation is the most important variable with the highest mean decrease in Gini index, 4.8070253.

Visualize the model using plotly with lines tracking the overall oob rate and for the two classes of the target variable. What patterns do you notice?

census_RF_error = data.frame(1:nrow(census_RF$err.rate),
                                census_RF$err.rate)



colnames(census_RF_error) = c("Number of Trees", "Out of the Box",
                                 "<=50K", ">50K")

census_RF_error$Diff <- census_RF_error$`>50K`-census_RF_error$`<=50K`


library(plotly)


fig <- plot_ly(x=census_RF_error$`Number of Trees`, y=census_RF_error$Diff,name="Diff", type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y=census_RF_error$`Out of the Box`, name="OOB_Er")
fig <- fig %>% add_trace(y=census_RF_error$`<=50K`, name="<=50K")
fig <- fig %>% add_trace(y=census_RF_error$`>50K`, name=">50K")

fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

All graphs have a high variance with a low number of trees, but they begin to level out as the number of trees grow. In addition, the error for the >50k class is much higher than the <=50k class.

Pull the confusion matrix out of the model and discuss the results

census_RF$confusion
##       0    1 class.error
## 0 20561 1261  0.05778572
## 1  3248 3711  0.46673373
census_RF_acc = sum(census_RF$confusion[row(census_RF$confusion) == 
                                                col(census_RF$confusion)]) / 
  sum(census_RF$confusion)

census_RF_acc
## [1] 0.8433188

The accuracy of the model is 84.33%.

Determine which tree created by the model has the lowest out of bag error

min_OOB <- min(census_RF_error$`Out of the Box`)
number_trees <- which(census_RF_error == min_OOB, arr.ind=TRUE)[,1]
number_trees
## row 
##  93

Using the error.rate matrix select the number of trees that appears to minimize your error.

min_OOB_matrix <- min(census_RF$err.rate[,1])
number_trees_matrix <- which(census_RF$err.rate == min_OOB, arr.ind=TRUE)[,1]
number_trees_matrix
## row 
##  93

Min OOB Model

Running Min OOB Model

Using the visualizations, oob error and errors for the poss and negative class, select a new number of trees and rerun the model

census_RF_2
## 
## Call:
##  randomForest(formula = income ~ ., data = census_train, ntree = 93,      mtry = mtry_initial, replace = TRUE, sampsize = 300, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 93
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.96%
## Confusion matrix:
##       0    1 class.error
## 0 20432 1390  0.06369719
## 1  2915 4044  0.41888202

Comparing two models

Compare the new model to the original and discuss the differences

census_RF$confusion
##       0    1 class.error
## 0 20561 1261  0.05778572
## 1  3248 3711  0.46673373
census_RF_2$confusion
##       0    1 class.error
## 0 20432 1390  0.06369719
## 1  2915 4044  0.41888202
census_RF_2_acc = sum(census_RF_2$confusion[row(census_RF_2$confusion) == 
                                                col(census_RF_2$confusion)]) / 
  sum(census_RF_2$confusion)

census_RF_2_acc
## [1] 0.8504079

The accuracy for the second model is 85.04%, and is 0.71% higher than the first model with a rate of 84.33%.

Min OOB Model Analysis

Use the better performing model to predict with your training set

census_predict = predict(census_RF_2,      #<- a randomForest model
                            census_test,      #<- the test data set to use
                            type = "response",   #<- what results to produce, see the help menu for the options
                            predict.all = TRUE,  #<- should the predictions of all trees be kept?
                            proximity = TRUE)

Use the confusion matrix function to assess the model quality

census_test_pred = data.frame(census_test, 
                                 Prediction = census_predict$predicted$aggregate)
confusionMatrix(census_test_pred$Prediction,census_test_pred$income,positive = "1", 
                dnn=c("Prediction", "Actual"), mode = "everything")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    0    1
##          0 2330  310
##          1  131  426
##                                           
##                Accuracy : 0.8621          
##                  95% CI : (0.8496, 0.8738)
##     No Information Rate : 0.7698          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5745          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5788          
##             Specificity : 0.9468          
##          Pos Pred Value : 0.7648          
##          Neg Pred Value : 0.8826          
##               Precision : 0.7648          
##                  Recall : 0.5788          
##                      F1 : 0.6589          
##              Prevalence : 0.2302          
##          Detection Rate : 0.1332          
##    Detection Prevalence : 0.1742          
##       Balanced Accuracy : 0.7628          
##                                           
##        'Positive' Class : 1               
## 

Create a variable importance plot and discuss the results

census2_importance_DF <- as.data.frame(census_RF_2$importance)
kable(census2_importance_DF[-c(1:2)])
MeanDecreaseAccuracy MeanDecreaseGini
age 0.0118476 9.2277676
workclass 0.0023134 4.6685858
fnlwgt 0.0002341 8.0975631
education 0.0154417 9.7668028
education_num 0.0142407 6.1403563
marital_status 0.0372957 9.7215352
occupation 0.0157780 13.5910933
relationship 0.0315023 10.4430520
race 0.0002394 0.8735149
sex 0.0033935 1.1472097
capital_gain 0.0193705 7.3990921
capital_loss 0.0026479 2.2409120
hours_per_week 0.0047495 5.9747754
native_country -0.0001427 1.9255412
varImpPlot(census_RF_2,     
           sort = TRUE,        
           n.var = 10,        
           main = "Important Factors for Identifying Income Range",
           #cex = 2,           #<- size of characters or symbols
           bg = "white",       
           color = "blue",     
           lcolor = "orange")

Occupation has the highest mean decrease in Gini index and the second highest mean decrease in accuracy. This highlights how important the occupation variable is. Relationship is also a very important variable, with the second-highest mean decrease in Gini index and the third-highest mean decrease in accuracy.

Use the tuneRf function to select a optimal number of variables to use during the tree building process, if necessary rebuild your model and compare the results the previous model.

set.seed(2)
census_RF_mtry = tuneRF(data.frame(census_train[ ,1:14]),  
                           as.factor(census_train$income),     
                           mtryStart = 5,                        
                           ntreeTry = 100,                       
                           stepFactor = 2,                      
                           improve = 0.05,                       
                           trace = TRUE,                         
                           plot = TRUE,                          
                           doBest = FALSE)
## mtry = 5  OOB error = 14.54% 
## Searching left ...
## mtry = 3     OOB error = 13.84% 
## 0.04804015 0.05 
## Searching right ...
## mtry = 10    OOB error = 14.93% 
## -0.02676864 0.05

Create a graphic that shows the size of the trees being grown for your final model.

hist(treesize(census_RF, terminal = TRUE), main="Tree Size")

Evaluate the final model using the performance/prediction functions in the ROCR package.

census_RF_prediction = as.data.frame(as.numeric(as.character(census_RF_2$votes[,2])))
census_train_actual = data.frame(census_train[,15])

census_prediction_comparison = prediction(census_RF_prediction, census_train_actual)

census_pred_performance = performance(census_prediction_comparison, 
                                         measure = "tpr",    #<- performance measure to use for the evaluation
                                         x.measure = "fpr")

census_rates = data.frame(fp = census_prediction_comparison@fp,  #<- false positive classification.
                             tp = census_prediction_comparison@tp,  #<- true positive classification.
                             tn = census_prediction_comparison@tn,  #<- true negative classification.
                             fn = census_prediction_comparison@fn) #<- false negative classification.

colnames(census_rates) = c("fp", "tp", "tn", "fn")

tpr = census_rates$tp / (census_rates$tp + census_rates$fn)
fpr = census_rates$fp / (census_rates$fp + census_rates$tn)

census_rates_comparison = data.frame(census_pred_performance@x.values,
                                        census_pred_performance@y.values,
                                        fpr,
                                        tpr)
colnames(census_rates_comparison) = c("x.values","y.values","fpr","tpr") #<- rename columns accordingly.

census_auc_RF = performance(census_prediction_comparison, 
                               "auc")@y.values[[1]]


plot(census_pred_performance, 
     col = "red", 
     lwd = 3, 
     main = "ROC curve")+
  grid(col = "black")+
  abline(a = 0, 
       b = 1,
       lwd = 2,
       lty = 2,
       col = "gray")+
  text(x = 0.5, 
     y = 0.5, 
     labels = paste0("AUC = ", 
                     round(census_auc_RF,
                           2)))

## integer(0)

Summary

Our final model was the model we made using the number of trees that minimized our OOB error. The model had an accuracy of .8621, much higher than the no information rate of .7698. Despite this high accuracy, we had a moderate kappa value of .5745. To get some more insights, we will look at our sensitivity and specificity. Our model struggled to predict the positive class, with a sensitivity of only .5788. On the other hand, we had a specificity of .9468. These values show how the accuracy of our model was likely inflated. The dataset was not balanced, with 24,283 individuals making less than $50k and only 7695 individuals making more than $50k. This means that despite struggling to predict the positive class, the success in predicting the negative class outweighs these struggles, inflating our accuracy. The model had an AUC of .9, but this high number is likely due to the issues we just highlighted. Looking at the Gini index and accuracy for each variable, occupation was clearly the most important, with education, age, and marital status also proving to be important.